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Abstract Almost all neural computations involve making predictions. Whether 
an organism is trying to catch prey, avoid predators, or simply move through a 
complex environment, the data it collects through its senses can guide its actions 
only to the extent that it can extract from these data information about the future 
state of the world. An essential aspect of the problem in all these forms is that 
not all features of the past carry predictive power. Since there are costs associ¬ 
ated with representing and transmitting information, a natural hypothesis is that 
sensory systems have developed coding strategies that are optimized to minimize 
these costs, keeping only a limited number of bits of information about the past 
and ensuring that these bits are maximally informative about the future. Another 
important feature of the prediction problem is that the physics of the world is di¬ 
verse enough to contain a wide range of possible statistical ensembles, yet not all 
motion is probable. Thus, the brain might not be a generalized predictive machine; 
it might have evolved to specifically solve the prediction problems most common 
in the natural environment. This paper reviews recent results on predictive cod¬ 
ing and optimal predictive information in the retina and suggests approaches for 
quantifying prediction in response to natural motion. 
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1 Introduction 

What are the predictable components of the input to an animal’s visual system in 
its natural environment? While the characteristics of static images have been ex¬ 
plored in large image repositories 1(11121113111361 1^ 1^1431 . some, but not as much 
is known or measured in the temporal domain II21II12L One interesting feature 
of scaling in static images is the power-law distribution of spatial variation in 
local contrast 112 111361 . This scaling implies that natural images are scale-free, dis¬ 
playing the same basic structure on all length scales. Power-law behavior in the 
frequency distribution of temporal fluctuations in total scene luminance have also 
been observed in a variety of natural contexts, and scenes display slightly different 
exponents depending on their specific content ina. This paper will review recent 
attempts to connect natural motion statistics to efficient prediction in the visual 
system, focusing on the retina. Tying temporal statistics of natural scenes to neu¬ 
ral prediction will reveal what types of motion the brain can efficiently represent 
and therefore constrain the types of predictions the brain can perform. 

The concept of efficient coding for prediction in the brain has been developed 
in two main ways: via theories of predictive coding 14 1 11421(3511261(^ 151 that elim¬ 
inates redundancy in the temporal response of the brain, and through analytical 
work to characterize the optimal trade-offs between representing the past and fu¬ 
ture sensory input UlOllllI (via information bottleneck calculations ll44lll6l(T8(l ~). In 
this paper, we will review and relate these two approaches to neural coding in 
the retina, and propose methods for extending this work to the context of natural 
motion statistics. 

It has been shown that retinal ganglion cells (RGCs), the output cells of the 
retina whose axons form the optic nerve, display a whole host of nonlinear pro¬ 
cessing characteristics that may be connected to prediction. RGCs respond differ¬ 
entially to object versus background motion II32L Ganglion cells have also been 
shown to code for a variety of motion features in ways that cannot be accounted 
for by a simple receptive held picture of encoding. This includes; motion antici¬ 
pation, the coding in the retina for the anticipated position of an object moving at 
constant velocity O; the omitted-stimulus response, in which ganglion cells Are 
after the cessation of a sequence of visual flashes at the appropriate delay where 
the next flash in the sequence would be expected 081 : and reversal responses, 
where neurons in the retina Are a synchronous burst of activity after the reversal 
of a moving bar, irrespective of their relative receptive held positions 091 . All of 
these adaptive motion-processing features speak to the retina’s complexity as an 
encoding device, and have a component of predictability of the future state of vi¬ 
sual stimuli. Recent work has also shown that the retina solves a general prediction 
problem in a near-optimal wav 03l . 

We will review this background material and discuss how extensions of this 
work could reveal how an organism’s ecological niche shapes its predictive pro¬ 
cessing. In particular, it may be that the retinas of different species possess the 
capacity to solve different suites of motion prediction problems tailored to their 
natural environment. Evolution may shape which problems are hardwired in the 
early visual system, and exploring that can uncover just how far the retina is able 
to tune its predictive power to the statistics of its inputs. 
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2 Theories of optimal prediction 

Sitting at the front end of the visual system and with a limited number of hbers to 
transmit all the visual information the brain receives, the retina has long been hy¬ 
pothesized to be an efficient and perhaps even optimal encoder of the visual world 
m[Il l26l|20l . This notion of efficiency dictates that the retina’s representation 
of the visual inputs to the photoreceptor layer should be as loss-less as possible, 
given the number of cables the retina has along which to transmit information to 
the brain and the fact that metabolic constraints limit the firing rate of neurons. 
To make the best use of each hber, these signals should be independent in space 
and in time. Recent work from a variety of researchers expand on this simple no¬ 
tion of efficiency. Natural inputs to the retina are non-Gaussian ll21ll3^ . the noise 
spectrum in neural data is not white, retinal bring is certainly highly redundant 
If34l , and not all information about the input is equally relevant to the organism. 
The concept of optimizing the predictive capacity of the retina assigns value to 
particular bits of information: it says that compression is only successful when 
the transmitted bits convey information about the future input lllll . The informa¬ 
tion bottleneck method ll44l is a way of debning relevant information, in this case 
information about the future, as the distortion measure. 


2.1 Efficient coding in the time domain: Predictive coding 

The efficient coding hypothesis states that all of the information about the input 
should be retained, while minimizing the entropy of the response ll40ll . If not all 
bits of information about the input are retained, the problem can be formulated 
using rate distortion theory 171 , 

min I{R,; Sr)+15D{R„S,), (1) 

P{r+,) 

where D is the average distortion, R is the neural response to the stimulus S, and 
the minimization bnds the lowest transmitted bit rate, given D. 

The core concept in predictive coding, in the time domain, is that temporal 
correlations in the output stream should be eliminated, so that only deviations 
from expected response, or those that are ‘surprising’ are encoded HD- If the 
input statistics are stationary, predictive coding aims to minimize the response of 
the system. The role of neurons in a predictive coding paradigm is to code for 
changes in response statistics, not the ongoing predictable events in a stationary 
world. 

Predictive coding has been postulated to be achieved through feedback con¬ 
nections from higher areas onto sensory input streams 051l29ll45l l5l. and early 
ED as well as recent work in the retina hypothesizes feedforward adaptive mech¬ 
anisms at the sensory periphery may result in predictive coding ESI . As such, 
predictive coding is highly efficient, because redundancy in time is eliminated. 
Mechanisms have been proposed by which the retina could implement predic¬ 
tive coding, via inhibitory interactions at bipolar terminals II26L Also, the work of 
Deneve shows how predictive coding may be self-organized in neural networks 

ini. 
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2.2 Predictive information 

Information theoretic treatments of prediction in the brain focus on defining not 
just the code that retains the most stimulus information for a given output bit rate, 
but the one that retains the most information about the future stimulus. This ad¬ 
dition of the notion of relevant information has sharpened discussions of early 
sensory processing in the context of prediction 111 11 . The theory for retaining the 
optimal amount of predictive information has been well-developed by Tishby and 
colleagues II44II16IIT8]| . and leads not only to elegant but also testable results. Re¬ 
cent experimental and theoretical work draws on these results and has shown that 
the salamander retina may be optimized for prediction |l33j. 

The efficient representation of predictive information that we see in |[3^ adds 
the notion of relevant information to the classical ideas of efficient coding. The 
simplest version of the efficient coding hypothesis is that the retina processes vi¬ 
sual inputs to remove redundancy, allowing the array of retinal ganglion cells to 
make fuller use of their limited capacity to transmit information ||4]|2][T1. The re¬ 
sults in suggest that the retina is not designed to represent all of the input light 
patterns impinging on its photoreceptors, but instead to represent those parts of the 
input that are most predictive of the future. The retina clearly throws away some 
aspects of the input light patterns, but perhaps only those parts that are irrelevant 
for the task of prediction. 

Information bottleneck approaches The maximal amount of predictive informa¬ 
tion a system can possibly encode can be found by solving the following informa¬ 
tion bottleneck problem: 

min l(Rt',St^t-At.t- 2 At...) ~ (2) 

PirM) 

This can also be understood as a rate distortion problem where the distortion met¬ 
ric is the predictive information. Here we can see how predictive information and 
predictive coding are really solving complementary optimization problems for sta¬ 
tionary stimuli. 

Providing an efficient representation of predictive information is nearly oppo¬ 
site of what one would expect from neurons doing predictive coding. In that type 
of code, signals are decorrelated in time so that predictable components are elim¬ 
inated and neurons encode only the deviations from expectation, or surprise. The 
responses of neurons implementing a truly optimal predictive code in a stationary 
input environment would thus carry no predictive information about their own re¬ 
sponses. In contrast, recent results 13311 suggest that the retina has a large amount 
of response predictive information, and that these responses efficiently separate 
predictive from non-predictive bits, and transmit the predictive bits preferentially. 

Predictive coding does not explicitly preserve information about the future 
stimulus. As such, it is hard to compare to optimal predictive information schemes. 
The prescription for predictive coding is if S is the input, and R* is the expected re¬ 
sponse and R is the observed response: when p{r\s) is much different from p{r*\s), 
respond. Somewhere in the brain must live the model(s) that generates r*. Higher 
cortical areas that inhibit early sensory areas, such as the lateral geniculate nu¬ 
cleus, might provide precisely this type of feedback, and have been implicated in 
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experimental evidence for predictive coding, se e.g. ll^ . In the retina, with little 
to no feedback from any downstream area, these models must be wired into the 
retinal circuit. 

Coding for surprising stimuli Predictive coding and optimal predictive informa¬ 
tion are mostly opposed processing theories. We illustrate this by way of a few toy 
examples; If we imagine a world that is wholly static, there is nothing to predict, 
no predictive information, and a predictive coder would have no response. If the 
world is instead completely stochastic, however, predictive coding would dictate 
that all noise signals that deviate strongly from the prior on the input generate a 
response, since each one is unexpected and therefore surprising. These surprising 
inputs are, however, uninformative about the future, since they are a pure noise sig¬ 
nal. The predictive information present is zero and no response modulation should 
be encoded. Predictive coding is, however, designed explicitly for non-stationary 
stimuli. Predictive information optimization would code for a surprising change 
in the inputs, since that will have maximum information about the future state. 
In that sense the two methods are aligned and preserving predictive information 
preferentially over other bits of past information is ‘efficient’. 

Where models of the input statistics are stored Predictive coding, in its clear for¬ 
mulation by Rao and Ballard, postulates that a set of models of the input statistics 
are present in higher order areas that feedback onto sensory areas to suppress re¬ 
sponse to predictable events ll35]l . Recent work has elegantly shown how this can 
be implemented in a hierarchical (Bayesian) framework 113 . With limited feed¬ 
back impinging on the retina, however, the retina itself must store the models of 
the input statistics it will receive. It has been demonstrated how adaptive gain 
mechanisms in the retina, that have presumably been encoded over evolutionary 
time, can instantiate predictive coding in the retina 1261 . To further test these ideas 
in the context of natural scene statistics, one needs to define the set of motion 
models an organism encounters in its natural environment. 


3 LNP models fail to capture motion processing in the retina 

In many contexts, our simplest models of retinal ganglion cell firing fail to recapit¬ 
ulate the actual response properties of the retina. This is most clearly demonstrated 
for motion stimuli. So called LN, or linear-nonlinear models, of neural processing 
fail to recapitulate the motion processing properties of the retina for a variety of 
moving stimuli, including motion anticipation, the reversal response, object mo¬ 
tion detection, and the omitted-stimulus response. Sometimes complex adaptive 
gain controls mechanisms need to be added to these models to explain the motion 
processing of the retina. In LN models, the probability of spiking is an instanta¬ 
neous, nonlinear function of a linearly filtered version of the sensory input. In the 
case of retinal ganglion cells that we study here, the inputs are the image or con¬ 
trast as a function of space and time, s{x,t)- Thus, if we write the probability per 
unit time of a spike (the firing rate), we have 


rLN(0 = rQg{z) 


( 3 ) 
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where tq sets the scale of firing rates, g(z) is a dimensionless nonlinear function, 
and 



(4) 


where the function f(x, t) is the receptive field. 

If we deliver stimuli that are drawn from a Gaussian white noise ensemble, 
then 


/(x, t) °c (i(x, f - t)5 (f - fspike)) 


(5) 


where f^pike is the time of a spike and (• • •) denotes an average over the stimulus 
ensemble im. If we take the LN model derived from the random checkerboard 
stimuli and use it to produce neural responses to the moving bar stimulus, the 
predictive information carried by the neurons is drastically wrong. In recent re¬ 
views from Gollisch and Meister 12211 and Berry and Schwartz 13 , the myriad 
ways in which a simple linear-nonlinear-Poisson (LNP) model fails to reproduce 
known retinal response properties are described in detail. When simple bars of 
light move across the retina, reverse their path, blink on and off, or move in more 
complex ways, many kinds of nonlinear processes in the retina are activated. None 
of these effects are captured by this basic version of the LN model for retinal fir¬ 
ing. Additionally, results in 1321 reveal that the LN model fails to recapitulate the 
near-optimal behavior of the real data, as illustrated for a slightly different LN 
model here in Figure [T] 

All groups fall away from the bound determined by Af = 1/60 s, the delay 
between the current response and the onset of the common future. When we com¬ 
pute information about the future, we assume that the future starts now, and do 
not make any allowances for processing delays. We could, instead, compare the 
performance of the LN model with bounds calculated assuming that there is a 
delay between past and future, so that At* = At -|-fdeiay The bound for At* is 
chosen to be fdeiay = 117ms, comparable to the delay one might estimate from the 
peak of the information about position, or from the structure of the receptive fields 
themselves. Interestingly, the model neurons do come close to this less restrictive 
bound. 

Of course, these model cells might not be optimal at any delay, instead they 
could fail to represent all of the predictable components of the stimulus, such as 
the velocity. Real salamander RGCs have a delay of at least 50ms, as measured by 
the time to the peak firing rate induced by a flash. The data reveal that the retina has 
a mechanism that allows it to saturate the bound on the predictive information with 
almost zero effective processing delay when responding to a predictable moving 
stimulus. 

This work only scratches the surface of optimal coding for the future stimu¬ 
lus in the retina. The motion statistics chosen here were chosen to include both 
two time scales of predictable motion as well as a purely stochastic motion com¬ 
ponent, while still being soluble via Gaussian information bottleneck techniques 
na. Important extensions of this work will test other parametric motion models 
with different statistics, but one of the most important directions of future research 
here is to explore motion models that mimic the properties of natural scenes. 
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Fig. 1 The predictive information present in groups of 5 retinal ganglion cells (RGCs, black 
dots), as well as for model neurons fit with linear-nonlinear (LN) models (red dots). The bound 
on the maximal amount of information about the future a group response with a given about of 
information about the past stimulus is denoted with the black line. The input is a moving bar 
stimulus as in (33). Some model groups have more information about the past because the LN 
model does not capture the low-level noise in the response of the RGC’s. Other model groups 
have less info than observed in real data because they fail to capture the stimulus driven response 
of these cells. No model groups capture as much information about the future of the stimulus as 
the core predictive groups of the cells in the retina. 


4 Towards naturalistic motion stimuli 

Natural scenes have heavy-tailed distributions of many quantities of interest, in¬ 
cluding intensity, contrast, and temporal modulation frequency ll31ll211l^fT^ . 
This means that there exists no single length scale in space or time one can use 
to coarse-grain natural scenes without sacrificing large amounts of structure in 
the data, and that potentially salient fluctuations exist on all scales. We illustrate 
some basic statistics of natural scenes in Figure]^ taken from our own database of 
natural movies. 

To fully test whether the retina conveys information to the brain in a way that 
is optimized for prediction, we must consider what prediction problems the retina 
evolved to solve. In particular, if the retina solves only a subset of the possible 
spatiotemporal prediction problems that could possibly confront its input, it should 
solve those that are present in the natural environment. It could even be the case 
that the retinas of different animals evolved to encode most efficiently prediction 
on the scales and correlation structures present in its own ecological niche. To 
test this, we need a framework for quantifying the prediction problems present in 
natural visual stimuli. 


4.1 Quantifying optimal coding for natural motion 

It is not possible to measure the information content of a neuron’s response to 
the pixel-by-pixel representation of an image sequence of any useful size. Proxies 
for this calculation include computing information about time within a long and 
ergodic stimulus sequence 1141 . but a better approach is to find some reduced and 








Fig. 2 (a) Representative frames from 3 natural movies: bees on a honeycomb (a plate of glass 
exposes the hive from the side); pack ice flowing in Lake Michigan; tree blowing in the wind, (b) 
Contrast distribution for an ensemble of 8 natural movie clips of 20 s each, filmed at 60 Hz. The 
contrast of each pixel is defined as C = ///q where I is the 8-bit intensity value for that pixel and 
/o is chosen for each frame such that the average contrast for that frame is 0. Distributions were 
estimated at different scales by averaging contrast over NxN blocks. Contrast distributions are 
normalized to have unit variance, (c) Temporal power spectra for 8 natural movie clips. Spectra 
were computed for each pixel using a sliding Hamming window of 256 frames with 50% overlap, 
then averaged across pixels. 


parametrizable (and therefore sample-able) representation of the motion present 
in the natural environment that retains the features relevant to the organism. The 
heavy-tailed nature of natural scene statistics do not offer up a coarse-graining 
length scale over which we might smooth our inputs. Instead, we are left with 
the difficult task of quantifying and summarizing the key statistics of the natural 
world. One useful extraction in the context of prediction is to track salient objects 
in natural scenes and analyze the statistics of those trajectories. Another useful 
experiment would be to show the retina of one animal object trajectories from a 
different ecological niche and determine whether it fails at this predictive compu¬ 
tation while excelling at those drawn from its native environment. 


Higher-order statistics of motion Deciding how to quantify a natural scene can be 
challenging. The high dimensionality of natural inputs to the visual system means 
that direct approaches to quantifying the information transmission is wrought with 
sampling error pitfalls or completely impossible. Making educated guesses about 
what features of natural motion to quantify, or searching directly for a lower di¬ 
mensional representation of the structure of natural scenes are two promising ap- 
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proaches to this problem. Recent work has defined a set of motion primitives, 
correlations structures in space and time that find their basis in early theories of 
motion processing, and extend and complete a local motion structure. The Fourier, 
non-Fourier, expander and glider components of local motion can be readily com¬ 
puted from natural movies II27II30II . This approach reveals that certain ratios of 
these components may be prevalent in natural scenes ll30ll . The brain might make 
use of this fact to tailor its motion processing to just these types of input. If so, 
deviations from this natural ratio should lead to noisier, less efficient coding for 
the future stimulus. Downstream of the retina, this could lead to motion perception 
deficits. 

Such local motion signatures only characterize part of the total motion signal 
in natural scenes. Machine learning efforts have been launched to find longer- 
range, collective components of natural image and motion statistics Ill51l24ll25l 
[J7l . Work from these groups has shown that some physical models of long range 
fluctuations may be applicable to natural motion. This is exciting because it could 
lead to a generative model for such motion, opening up the possibility of more 
stringent, parametrized tests of optimal coding for motion in the retina. 


5 Conclusions 

Testing theories of optimal prediction in the visual stream requires an integration 
of existing theories of optimal coding in the retina and beyond, with a careful 
quantification of the motion statistics present in the natural environment. By ex¬ 
amining how well information processing in the brain is tuned to natural inputs, 
we may discover new static and adaptive features of the predictive part of the 
neural code. 
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